Windows functions are great for visualizing data and extracting insights that cannot be gathered through grouby and multi groupby fuinctions.
Windows Overview
To further understand this we will use the following dataset to compare the differnece and function of groupby and windows functions. This is an example dataset that shows how many hours that employees clocked during a week (it was a slow week). The example code will also be given in pyspark.
employee | branch | Hrs |
---|---|---|
Jake Ballard | 1 | 10 |
Olivia Pope | 1 | 19 |
Fitz Gerald | 1 | 8 |
Cyrus Bean | 1 | 9 |
Melly Grant | 2 | 16 |
Harrison Right | 2 | 15 |
Elizabeth North | 2 | 9 |
Quin Perkins | 3 | 15 |
Aby Wheelend | 3 | 12 |
Markus Walker | 3 | 6 |
#### Groupby roupby functions opperate by summarizing data according to a specified group. I this example we will grouby the hours worked in each brand and then sumarize that data by taking the average of each group. This is often much faster than a windows due to the simplicity and is avaliability in most programing languages. A groupby would yield the following result from running the following code.
df.groupBy("branch").avg("Hrs").show(truncate=False)
branch | Avg Hrs |
---|---|
1 | 11.5 |
2 | 13.3 |
3 | 11 |
Windows
When preforming a windows function you can attain aggragate functions without loosing the the rows. This can be very useful when trying to create a new dataframe. An example of how this dataframe could provide more value than a groupby is if you wanted to display a double bar chart comparing the average hours worked in a branch to that of the hours worked by an individual employee. This function has two main parts, a partition and a relational grouped dataset. The first term is what it will partition the dataset into to create the new term/column. The second term can be one of couple of functions that can transform the data. You can then create a new column aggregating the function with that “window” that was created as demonstrated below. A windows functions yields the following result
Window.partitionBy("branch").orderBy("hrs")
df.withColumn("Avg Hrs",avg.over(windowSpec))
df.withColumn("row_number",row_number.over(windowSpec))
employee | branch | Hrs | Avg Hrs |
---|---|---|---|
Quin Perkins | 3 | 15 | 11 |
Aby Wheelend | 3 | 12 | 11 |
Markus Walker | 3 | 6 | 11 |
Olivia Pope | 1 | 19 | 11.5 |
Jake Ballard | 1 | 10 | 11.5 |
Fitz Gerald | 1 | 8 | 11.5 |
Cyrus Bean | 1 | 9 | 11.5 |
Melly Grant | 2 | 16 | 13.3 |
Harrison Right | 2 | 15 | 13.3 |
Elizabeth North | 2 | 9 | 13.3 |
History
The windows function originates as a SQL function, however has made it way into many other programming languages over the last couple of years. This function was introduced in 2003. Added functionalty was introduced later on.
Example Code
SQL
SELECT
RANK() OVER (PARTITION BY Value_1 ORDER BY Value_2 DESC)
AS Value_1_ranking,
Column_1,
Column_2,
Column_3,
Column_4
FROM df;
Pyspark
Window.partitionBy("value_1").orderBy("value_2")
Other Articles on Windows Function
Blog Articles
- BYUI Big Data Windows
- Apache Drill Article
- SQL Windows Functions
- Wikipedia on SQL
- Geeks for Geeks SQL Windows Function
https://www.geeksforgeeks.org/window-functions-in-sql/